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The effort to identify genes with periodic expression during the 
cell cycle from genome- wide microarray time series data has been on- 
going for a decade. However, the lack of rigorous modeling of periodic 
expression as well as the lack of a comprehensive model for integrating 
information across genes and experiments has impaired the effort for 
the accurate identification of periodically expressed genes. To address 
the problem, we introduce a Bayesian model to integrate multiple in- 
dependent microarray data sets from three recent genome-wide cell 
cycle studies on fission yeast. A hierarchical model was used for data 
integration. In order to facilitate an efficient Monte Carlo sampling 
from the joint posterior distribution, we develop a novel Metropolis- 
Hastings group move. A surprising finding from our integrated analy- 
sis is that more than 40% of the genes in fission yeast are significantly 
periodically expressed, greatly enhancing the reported 10-15% of the 
genes in the current literature. It calls for a reconsideration of the 
periodically expressed gene detection problem. 

1. Introduction. The cell division cycle is the concerted sequence of 
processes by which a cell duplicates its DNA and divides into two daughter 
cells. Many genes are expressed periodically at a specific stage during the 
cell cycle when they peak and trough over a certain time range. They are 
termed as "cell cycle-regulated genes." Here, in the context of mRNA expres- 
sion studies, we call these "Periodically Expressed (PE) genes." In contrast, 
other genes are called "APeriodically Expressed (APE) genes." Identifica- 
tion of PE genes is both of theoretical importance because of the need to 
understand the different mechanisms underlying these genes' involvements 
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in the cell cycle processes, and of practical importance due to the biologi- 
cal links between cell cycle control and many diseases such as cancer [Sherr 
(1996); Whitfield et al. (2002); Bar- Joseph et al. (2008)]. 

With the help of the microarray techniques and various cell phase synchro- 
nization methods (synchronizing the progression of cells through the stages 
of cell cycle), researchers have conducted genome- wide time series expres- 
sion analyses on synchronized cells for various species ranging from fungi to 
plant to human [Cho et al. (1998); Spellman et al. (1998); Laub et al. (2000); 
Ishida et al. (2001); Menges et al. (2002); Whitfield et al. (2002); Rustici 
et al. (2004); Oliva et al. (2005); Peng et al. (2005); Bar- Joseph et al. (2008)]. 
Several strategies for identifying PE genes on these data have been devel- 
oped, such as the fitting of a sinusoidal function [Spellman et al. (1998)], 
clustering techniques [Eisen et al. (1998); Whitfield et al. (2002)], the single- 
pulse model [Zhao, Prentice and Breeden (2001)], the partial least squares 
regression approach [Johansson, Lindgren and Berglund (2003)], the aver- 
age periodogram [Wichert, Fokianos and Strimmer (2004)], the linear com- 
bination of cubic .B-spline basis [Luan and Li (2004)], the random-periods 
model [Liu et al. (2004)], the least square fitting for the periodic-normal 
mixture model [Lu et al. (2004)], the Fourier score combined with p- value 
of regulation [de Lichtenberg et al. (2005)], the robust spectral estimator 
combined with (^-statistic [Ahdesmaki et al. (2005)] and the up-down signa- 
ture method [Willbrand et al. (2005)]. Zhou, Wakefield and Breeden (2005) 
applied a Bayesian approach for single experiment data by fixing the period 
at pre-estimated value. Most of these methods use a set of known PE genes 
to estimate the cell cycle period prior to testing the periodicity for other 
genes. 

While the previous efforts have often reported positively about the pres- 
ence of the periodic signal in these gene expression data, doubts were raised 
as to whether such periodic gene regulation was reproducible [Shedden and 
Cooper (2002); Wichert, Fokianos and Strimmer (2004)] and, by extension, 
about the identity and count of PE genes discovered by subsequent analyses. 
One prevalent reason for skepticism is the reliance of many of the studies 
on ad hoc thresholds to classify genes as PE or otherwise. For example, Cho 
et al. (1998) detected the PE genes by visual inspection; Spellman et al. 
(1998) designed a cutoff value based on prior biological knowledge. Another 
possible reason is that the commonly assumed white noise background model 
for time series might be too unrealistic to allow correct inference about the 
identity and count of PE genes [Futschik and Herzel (2008)]. Furthermore, 
all previous approaches were designed for analyzing single time series per 
gene, which did not allow for an efficient combination of data from multiple 
experiments and therefore lacked the power to identify a large fraction of all 
PE genes. Recently Tsiporkova and Boeva (2008) proposed a procedure to 
combine multi-experiment data based on a dynamic time warping alignment 
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technique, which is potentially useful for analyzing multiple cell cycle data 
sets if combined with a periodicity detection algorithm. However, the pro- 
cedure requires each time point within a time series to be aligned to a time 
point within the other time series, which is not always appropriate when 
the lengths of cell cycle period, the sampled time ranges and the sampling 
frequencies are all different between experiments. 

Recently, three independent studies [Rustici et al. (2004); Oliva et al. 
(2005); Peng et al. (2005)] conducted elutriation and cdc25 block-release 
synchronization experiments to measure genome-wide expression in the fis- 
sion yeast (Schizosaccharomyces pombe) cell cycle. The results from these 
three studies also showed discrepancies with regard to the identity and count 
of PE genes. They reported 407, 747 and 750 PE genes, respectively, with 
only 176 genes being common to all three lists. However, the availability 
of 10 genome-wide experiments produced by these three different labs has 
made the fission yeast currently the organism with the largest cell cycle 
transcriptome data, which provides us an opportunity to obtain a better 
understanding of the cell cycle. Marguerat et al. (2006) combined the ten 
data sets from the three studies by multiplying p-values for gene regula- 
tion and periodicity from each experiment. They concluded that no more 
than about 500 PE genes can be reliably identified from the combined data. 
While observing that well over 1000 fission yeast genes could be periodically 
expressed and that each study had detected a different subset of these, they 
attributed the discrepancy to inconsistent gene naming, the use of different 
data analysis methods and the use of arbitrary thresholds. 

We investigated the PE gene identification problem by employing a Bayesian 
approach to provide (1) a more realistic and comprehensive model for the 
cell cycle time series data, and (2) an efficient and rigorous way to combine 
data from multiple experiments. A hierarchical model together with MCMC 
computation is used to integrate different sources of variation and correla- 
tion into a single coherent probabilistic framework. We applied this approach 
to integrate the ten genome-wide time series data sets. A striking finding 
from our analysis is that more than 2000 genes are significantly periodically 
expressed. This number greatly enhances the count of possible cell cycle 
regulated genes in the current literature. Most interestingly, our finding can 
be visualized clearly from Figure 3, which merely displays the original data, 
but with the genes ordered according to our inferred periodicity strength 
and peaking phase. 

2. Materials and methods. In Section 2.1 we describe the cell cycle gene 
expression data. In Section 2.2 we outline our parametric model for cell 
cycle gene expression. The Bayesian computation of the model is described 
in Section 2.3 and Section 2.4. In Section 2.5 we present our strategies for 
distinguishing PE genes from APE genes based on the model fitting results. 



Table 1 

Summary of the ten experiments for the fission yeast cell cycle 



Data set name Rustici et al. Peng et al. Oliva et al. 



Microarray type spotted PCR array spotted oligo array spotted PCR array 



Synchronization technique 
Experiment name 




elutriation 




cdc25 


elutriation 


cdc25 


elutriation 


cdc25 


Expl 


Exp2 


Exp3 


Exp4 


Exp5 


Exp6 


Exp7 


Exp8 


Exp9 


ExplO 


Number of covered gene 


4113 


3921 


4176 


4281 


4173 


4263 


4571 


4543 


4400 


4727 


Number of time point (S e ) 


20 


20 


20 


19 


18 


33 


38 


33 


50 


52 


Time point frequency (min) 


15 


15 


15 


15 


15 


10 


10 


15-21 


2-10 


10-15 



Note: 

1. The data set Rustici et al. is downloaded from http://www.sanger.ac.uk/PostGenomics/S_pombe/projects/cellcycle/. Peng 
et al. is downloaded from http://giscompute.gis.a-star.edu.sg/~gisljh/CDC/CDC_dnld_data.html. Oliva et al. is downloaded from 
http : //publications .redgreengene . com/oliva_plos_2005/. 

2. The downloaded data set Rustici et al. has been normalized on an array-by-array basis using an in-house normalization script, 
which performs three steps: masking bad spots, filtering lower quality spots, applying local window-based normalization. Peng et al. has 
filtered low intensity features (2-fold less than the background) and done LOWESS normalization within array. Oliva et al. has been 
normalized within array by the GenePix Pro software with default setting. 

3. Elutriation experiments are done to wild-type fission yeast, where samples of uniformly sized cells are obtained. Because cell size is 
correlated with cell cycle stage, these cells are synchronized with respect to their position in the cycle. Cdc25 block-and-release experiments 
are done to the fission yeast strain carrying the temperature-sensitive cdc25-22 mutant gene, where cells are initially synchronized by 
blocking them at some particular cell cycle stage, then releasing them from the block and taking samples at different times. 
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2.1. Microarray time series data. We obtained the normalized gene ex- 
pression data for ten genome-wide experiments by three cell cycle microarray 
studies [Rustici et al. (2004); Oliva et al. (2005); Peng et al. (2005)] from 
the websites listed in Table 1. For each experiment, a culture of cells is 
grown and synchronized. A set of microarrays is used to measure gene ex- 
pressions at selected time points (possibly with technical replication of the 
microarray). All values were converted to log-ratios with base 2. To make 
the log-ratios comparable across arrays, we transformed the values for every 
array separately to set the median log-ratio of each array to zero. Log-ratios 
from technical replicates, if present, were averaged. Time series with more 
than 25 percent missing entries were omitted. We unified gene names across 
the studies based on GeneDB database entries [Hertz-Fowler et al. (2004)]. 
The genes without a consistent nomenclature were excluded. 

Let Y ge t denote the gene expression log-ratio at time T e % in experiment 
e for gene g, where g = 1, . . . , G, e = 1, . . . , E, t = 1, . . . , S e . Here Y get is the 
observed data; T e t, the time of the measurement; G, the total number of 
genes studied; E, the total number of independent experiments; and S e , 
the total number of time points measured in experiment e. The whole data 
set can be visualized as a G-by-E matrix of time series, where each row 
corresponds to one gene and each column corresponds to one experiment. 
If we pool together all filtered data from the ten data sets, we have that 
G = 4994, E = 10, and S e ranges from 18 to 52. A detailed overview of the 
data is given in Table 1. For illustration, the data for two genes are shown 
in Figure 1. 

2.2. Model. We model each time series as a mean curve with additive 
independent and identically distributed (i.i.d.) Gaussian noise for measured 
time points. The mean curve is a function of time consisting of a trend 
component and a periodic component. For the trend component, we use 
a linear function along with a truncated quadratic function to model the 
block-release effect [artifacts introduced by experimental protocols for syn- 
chronization; see Lu et al. (2004)] and the general trend shown by the time 
series. We assume a first order Fourier model for the periodic component. 
A damping term is added to the periodic component to model the cell cycle 
de-synchronization effect, which implies that the periodic phenomenon even- 
tually disappears as time increases. To model the whole matrix of time series, 
we assume that the periodic components for all genes within one experiment 
share the same period, which is equal to the cell division time (i.e., duration 
between the birth of a cell up to its division into two daughter cells). We 
further assume that the relative peak time within the cell cycle for every 
gene is fixed, which allows all genes to share the same phase shift when the 
periodic components across experiments are compared. More specifically, we 



Exp1 Exp2 Exp3 Exp4 Exp5 Exp6 Exp7 Exp8 Exp9 Exp10 




100 200 100 200 100 200 100 200 100 200 50 150 300 100 300 100 300 500 200 400 200 400 



Fig. 1. Observed data and fitted mean curves for two samples of genes. For each sub-figure, the horizontal axis is the time (minutes) 
and the vertical axis is the gene expression value (log-ratio) . The first row of sub-figures shows the ten time series for a known PE gene 
(SPAPYUG7.03C). The second row is for a stress response gene (SPAC23C4-09C), which is not regulated by the cell cycle. The bullet 
dots are the observed data. They are connected by dotted lines. The solid lines are the mean curves obtained by fitting the Mi model to 
the data. The dashed lines are the mean curves obtained by fitting the Mo model to the data. The details of model fitting are given in the 
following text. 
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assume the following model (Mi) for each time series: 
Yget — Ojge ~\~ bg e T e t + c ge (mm(T et - d ge ,0)) 

+ A ge COs(/i e T e i + </> e + ^K^' + EgeU 

where 

age + b ge T et + c ge (mm(T et -d ge ,0)) 2 : trend component, 
A ge cos(n e T e t + ip e + 4> g )e~ XeTet : periodic component, 
£ fl e* ~-W(0,of e ): i.i.d. noise, 

a ge ; b 9e : coefficients of the linear trend of a time series, 
d ge - ending time of block-release effect of a time series, 
Cge- magnitude of block-release effect of a time series, 
a"g e : noise level of a time series, 

Age- amplitude of periodic component of a time series, 

H e : cell cycle angular frequency, equal to 2ir divided by the period of cell 

cycle of an experiment, 

ip e : experiment-specific phase, which models the phase-shift between two 

experiments, 

(j) g : gene-specific phase, which decides its peaking time, 

A e : magnitude of the de-synchronization effect of an experiment. 

For each gene, we use different amplitude parameter A ge for different 
experiments to account for the effects of different experimental platforms and 
synchronization techniques. If a gene is not periodic, the fitted amplitude 
Age should be close to zero. For such time series, the phase parameter <p g 
is redundant. To capture different noise levels in different experiments, we 
specify a hierarchical structure for the noise component by assuming that all 
a^e from the same experiment share the same inverse chi-square distribution 
with chosen degree of freedom C\i (a constant specified in the Appendix) 
and unknown hyper-parameters £ e : 

a 2 ge \Ce ~ InV-X 2 {Ci2,(e)- 

For convenience, we introduce the following notation: 

Y = {Yg e t, for g = 1, . . . , G; e = 1, . . . , E; t = 1, . . . , 5 e }: expression values, 
e = {fi e , i^e, A e , Ce}'- experiment-specific parameters, 

e = {&!,..., e E }, 

$ = {0i,...,0G?}: gene phases, 

Yge = {a g e,bge,Cge,dge, A ge ,(Jg e }: time-series-specific parameters, 

r = {r 1 ,...,r G }. 
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Fig. 2. Dependence structure of all variables. All links are undirected. Bullets represent a 
variable or a group of variables. Diamonds represent the dependence of the variables linked 
to it. Corresponding to the G-by-E matrix of time series, the main parameter structure 
can be visualized as a matrix, where each row corresponds to a gene-specific parameter (f> g 
and each column corresponds to experiment- specific parameters (/i e ,^/) e , A e , f e )- Each cell of 
the matrix corresponds to the variables specific to a time series. For example, all cj> g 's are 
independent of each other conditional on all ip e , A e , Ce); o, time series is independent 
of all other time series conditional on the union of cj> g and (/i e , ip e , A e , £ e ). 



All variables may be visualized within a gene-by-experiment (i.e., G x E) 
matrix (Figure 2), which shows their dependence structure. Each row cor- 
responds to a gene-specific parameter (j) g and each column represents the 
set of experiment-specific parameters (/x e , i^ e ,\ e , Ce)- Each cell of the matrix 
corresponds to the variables specific to a time series. The gene-specific pa- 
rameter (ft g is the key to integrate the time series for gene g from multiple 
experiments. Experiment-specific parameters O e are used to pool informa- 
tion across all genes within a particular experiment. 

For model comparison, we also introduce the following model (Mo) for 
APE genes: 

Yget = a ge + b ge T et + c ge (mm(T et - d ge , 0)) 2 + e get . 

The only difference between Mq (null model) and M\ (alternative model) is 
the periodic component A ge cos(/i e T et + i/; e + (p g )e~ XeTet . 

2.3. Identifiability. In the M\ model, the phase parameters ip e and 4> g 
are not identifiable because the joint posterior distribution remains the same 
if we add a constant z to all ip e and subtract z from all 4> g . This noniden- 
tifiability problem can be solved by fixing one of the phase parameters, 
but the loss of one degree of freedom makes the MCMC algorithm very 
"sticky" (slow- mixing) . Since we only care about the relative values of 1/V S 
and </> 9 's, we solve the problem by assigning a reasonably tight prior distri- 
bution to one of the phase parameters and flatter priors to others, and using 
a transformation group move to improve mixing of the MCMC chain (see 
Appendix A. 3). 
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For periodic signal fitting, the angular frequency parameter fj, e is usually 
nonidentifiable because a time series with angular frequency fx e is also a 
time series with angular frequency fi e /n for any positive integer n. We avoid 
this problem by specifying the periodic signal as a damping single sinusoidal 
curve and limiting the domain of [i e to a bounded range. The bound of [i e 
is instituted via its prior which is based on our prior knowledge of the cell 
cycle duration in fission yeast. 

2.4. Bayesian computation. We estimate all unknown parameters through 
MCMC simulation of their joint posterior distribution. More specifically, we 
use a Metropolis-within-Gibbs algorithm to iteratively sample one set of 
parameters given all the others: 

• Step 1: sample experiment-specific parameters e conditional on T and 
Y, 

• Step 2: sample gene-specific parameters <j> g conditional on 6, T and Y, 

• Step 3: sample time series-specific parameters T ge conditional on 0, $ 
and Y. 

The MCMC chain composed of these basic moves suffers from a slow 
mixing problem caused by strong correlations among some parameters. We 
can alleviate the problem by parallelizing each of the three steps based on the 
conditional independence of the parameters. For instance, we can parallelize 
the sampling of T ge from their full conditional distribution since they are 
independent of each other given 0, $ and Y . When some parameters are 
highly correlated in their joint distribution, single-component moves cause 
very slow-mixing. To cope with this problem, we designed a new sampler 
called the Metropolized independence group sampler (MIPS) by combining 
the ideas of grouping [Liu, Wong and Kong (1994)] and the Metropolized 
independence sampler [Hastings (1970); Liu (1996); Liu (2001)]. The key 
idea is to update the whole subset of correlated variables simultaneously 
independent of the current state using a sequential proposing procedure. 
MIPS moves are inserted to the main Metropolis-within-Gibbs iteration. 
The details of the MCMC implementation are given in the Appendix. 

2.5. Strategies for discerning PE genes from APE genes. We used three 
statistics to judge which genes are PE ones. Among them, the Bayesian 
Information Criterion is used to compare the fitting of model M\ with that of 
model Mo, both to real data. The other two statistics measure the periodicity 
by comparing the fitting of the M\ model to the real data with that to the 
permuted data or the data simulated from the Mq model. 
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2.5.1. Permutation test. Since we fit model M\ to every gene, even the 
APE genes are modeled with experiment-specific parameters that are 
primarily determined by PE components. Therefore, to examine the effect of 
onr Bayesian model fitting procedure on APE genes, we generate background 
data by permuting each time series for every gene in the real data, which 
destroys any periodic pattern therein. We run the same MCMC algorithm 
to fit the Mi model to the background data set by fixing all experiment- 
specific parameters at the posterior mode obtained from the MCMC run 
for the real data. 

2.5.2. Simulation from the null model. One problem of using the per- 
mutation data as background control is that the permuted time series do 
not capture the intrinsic autocorrelation of the measured time series, which 
exists even if it is not periodically expressed. For example, many time series 
in the real data show a general trend without oscillation, which may be a 
result of the gene's response to the perturbation caused by synchronization 
techniques. To accommodate this possible bias, we generate a second data 
set from the Mq model. Compared to the permuted time series, Mq explains 
the autocorrelation in the time series by a mean curve. We run the same 
MCMC algorithm to fit Mq to all genes in the real data. 

We simulated from the Mq model a data set of similar size and structure 
as the combined real data set. All parameters are simulated from their cor- 
responding prior distributions. Both M\ and Mq are fitted to this simulated 
data set. While fitting Mi, we fix all experiment-specific parameters at 
the posterior mode obtained from the MCMC run for the real data. 

2.5.3. Model comparison. One approach for discerning PE genes from 
APE genes is to use permuted data or data simulated from the null model 
as background control, and to fit the M\ model to both the real data and the 
background data. The fitting of the background data is then used to deter- 
mine a threshold for the desired false positive rate (FPR). Another approach 
is to fit both models M\ and Mq to the real data, and then do the classifica- 
tion based on a comparison of the fitness of the models. Various information 
criteria can be used for this task, such as Akaike's Information Criterion 
(AIC) [Akaike (1973)], the Bayesian Information Criterion (BIC) [Schwarz 
(1978)] and the Deviance Information Criterion (DIC) [Spiegelhalter et al. 
(2002)], to just name a few. 

A full Bayesian alternative to our approach here is to introduce a latent 
variable I g for each gene to indicate whether it comes from Mi or Mq. Then, 
the reversible-jump strategy [Green (1995)] can be used to build a MCMC 
sampler to traverse the joint space of the latent indicators and model param- 
eters. But due to the global nature of many parameters in our model, this 
approach is computationally extremely expensive. Additionally, the results 



BAYESIAN META-ANALYSIS OF CELL CYCLE GENE EXPRESSION 11 

so obtained may be too sensitive to our model assumptions. Thus, we feel 
that using randomization and null model approaches in the spirit of poste- 
rior predictive model checking [Gelman, Meng and Stern (1996)] provides a 
more robust detection of PE genes. 

2.5.4. Statistics for periodicity. We use multiple gene-specific statistics 
to measure the periodicity of a gene. Based on the fitted parameter values 
for the Mi model, we define the gene-specific Signal-to-Noise Ratio (SNR) 
as the relative strength of the fitted periodic component compared to the 
noise level: 



The SNR statistic combines periodicity information for a gene from every 
experiment in terms of the amplitude of its periodic component. For each 
gene, we calculate SNR for each iteration of the MCMC chain, and then sum- 
marize the posterior samples of SNR using the 2.5th percentile, the 97.5th 
percentile and the mean. Genes with higher SNR values are more likely to 
be periodically expressed. We also use the fitted phase to measure period- 
icity from the fitted parameters of the Mi model. More specifically, we use 
the length of the 95% central posterior interval (denoted by LPI) of a gene's 
relative phase <j) g + ip\ (ipi is chosen arbitrarily since only the difference of 
relative phase matters) as one of the periodicity measures. Genes with higher 
LPIs are less likely to be periodic either because their periodic components 
are too weak or their multiple time series might show inconsistent peaking 
time within the cell cycle. 

We use the Bayesian Information Criterion difference (BIC 01 ) to measure 
periodicity based on the fitted posterior modes of the two models. Let L® 
and Lg denote the likelihood values for gene g at the posterior mode of 
the parameters for models Mq and Mi, respectively. The model comparison 



criterion BIC 01 is defined as BICf = 2 log(Lj) - 2 log(L°) - (Jfei - k ) log(JV), 



where N is the number of observed data points for the gene, and ko and k± 
are the number of free parameters in models Mq and Mi, respectively. A 
gene with positive BIC 01 value prefers model Mi to Mo. Genes with higher 
BIC 01 values are more likely to be periodically expressed. 

3. Results and discussion. 

3.1. Model fitting check. The MCMC chain on the entire real cell cycle 
data converged in approximately 2000 iterations. The autocorrelation func- 
tion of the posterior probabilities from each chain showed that the MCMC 
algorithm is efficient in terms of effective sample sizes after burn-in. The 
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details of the model fitting diagnosis are given in the supplemental material 
of this paper [Fan, Pyne and Liu (2009)]. Figure 7 in the supplementary ma- 
terial [Fan, Pyne and Liu (2009)] displays the posterior distribution of the 
cell cycle length 2ir/ ' \x e for each of the ten experiments. After convergence, 
the experiment-specific parameters showed little variation, that is, their 
marginal posterior distributions had very small variance compared to their 
ranges. Based on the posterior mode determined from the MCMC chain, 
we calculated the residue of each time series. The autocorrelation analysis 
of the residue showed that by fitting M\ to the data, the autocorrelation 
was reduced to the level comparable to those of i.i.d. noise. Comparison of 
variance reduction between the real and the permuted data suggested that 
the Mi model explained a significant amount of variance for most of the 
genes showing significant autocorrelation in their time series. 

3.2. Number of periodically expressed genes. We ranked all genes in the 
order of decreasing posterior mean SNR value. Thus, highly ranked genes 
are more likely to be periodically expressed. We then stratified this sorted 
list into 6 groups and reordered each group according to the fitted peaking 
time. Figure 3 shows the whole sorted data set. Strikingly, a periodic pattern 
stands out for all gene groups after simply reordering them (note that these 
are simply rearranged original data). The pattern is clear and consistent 
across all experiments for the top 2000 genes, which suggests that about 40% 
of all genes in the organism could be periodically expressed. The pattern is 
still strong for genes in the range 2001-3500. We can even observe periodicity 
among the remaining genes shown in the bottom group, which, however, is 
comparable to the top ranking "genes" in the permuted data. 

For a comparison with the result from traditional clustering methods, 
the microarray clustering software Cluster [Eisen et al. (1998)] was used to 
group genes with similar gene expression. A heatmap similar to Figure 3 
is included in the supplemental material of this paper [Fan, Pyne and Liu 
(2009)]. Compared to the ubiquitous periodic pattern in Figure 3, only sev- 
eral small clusters with visible periodic pattern may be observed from the 
hierarchical clustering result. 

We used two approaches to test whether the visual periodic pattern in 
Figure 3 is statistically significant. The first approach compares the fitting 
of the M\ model to the real and background data, that is, the permuted 
data or the data simulated from the Mq model. Two statistics are used to 
measure the periodicity for this approach. The SNR statistic measures the 
amplitude of the periodic component, while the LPI statistic measures the 
uncertainty of the relative phase of every gene. Figure 4 and Figure 8 in 
the supplementary material [Fan, Pyne and Liu (2009)] show the estimated 
posterior densities of these measures. The curves from the background data 
provide a null distribution for the corresponding statistic, from which we can 
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Fig. 3. Heatmap of all genes' time series data ranked by decreasing mean SNR value. 
Columns correspond to time points, which are grouped by experiment and sorted by time 
within each group. Rows correspond to genes, which are ranked by their mean SNR value 
and sorted by their mean peak times within each group. For example, the first row group 
contains the 300 genes with the highest mean SNR value from our combined analysis of 
all 10 experiments, and they are sorted by their relative phase 4> g + tpi within the group. 
Each time series is normalized to zero mean and unit variance for display. The heatmap is 
drawn by TreeView [Eisen et al. (1998)] with default setting. Red indicates up-regulation, 
green indicates down-regulation, black means no change of expression levels, and grey is 
missing data. It shows a periodic pattern for all gene groups. 

estimate FPR for any given threshold. The clear separation of the posterior 
densities for the real and background data suggests that a lot of genes show 
a periodic pattern that is stronger than i.i.d. noise or Mq data. For example, 
by comparing the LPI curves of the real and permuted data in Figure 8 in the 
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supplementary material [Fan, Pyne and Liu (2009)], we can claim 3086 PE 
genes for FPR = 0.002, corresponding to about 10 false positives. Similarly, 
by comparing the posterior mean SNR values of the real and permuted data 
in Figure 4, we can claim 3599 PE genes for FPR = 0.002. The number of 
claimed PE genes when using the simulated data from the Mq model as 
background control is similar. For instance, the comparison of the posterior 
mean SNR densities yields 3414 PE genes for FPR = 0.002, and that of the 
LPI densities yields 3036 PE genes for FPR = 0.002. 

The second approach compares the fitting of the two models M\ and Mq , 
both using the real data. We used BIC as the model comparison criterion. 
As shown in Figure 9 in the supplementary material [Fan, Pyne and Liu 
(2009)], almost all BIC 01 values from the permuted data as well as the 
simulated data from the Mq model are smaller than zero. For the real data, 




Fig. 4. Density comparison of SNR from the three data sets. The Mi model is fitted to 
the real data, permuted data and the data simulated from the Mo model. For each gene, we 
get the posterior mean of the SNR statistic from the combined analysis. For each data set, 
we pool all genes together to get a kernel density estimate, which is shown in this graph. 
The vertical line indicates the threshold corresponding to FPR — 0. 002 in the permuted 
data, from which one can claim 3599 PE genes from the real data. 
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Table 2 

Correlation of different statistics and their classification results 



Statistic 


SNR 


LPI 


BIC 01 


SNR 


3599 


3051 


1967 


LPI 


-0.93 


3086 


1906 


BIC 01 


0.86 


-0.83 


2003 



Note: The permuted data was used as background control. The lower-left part of the table 
shows the Spearman correlation between pairs of statistics. The numbers on the diagonal 
are the number of PE genes claimed by the corresponding statistic. For SNR, we use a 
cutoff corresponding to FPR = 0.002 for the two mean SNR density. For LPI, we also use 
the threshold corresponding to FPR = 0.002. We use zero as the threshold for BIC 01 . The 
upper-right part of the table shows the number of PE genes claimed by a pair of statistics. 
Within them, 1898 genes are claimed by all three statistics. 

we can claim 2003 PE genes from the combined analysis by using zero as the 
threshold for BIC 01 . Corresponding to this threshold, the permuted data will 
only produce one false positive PE gene, corresponding to FPR = 0.0002. 

The results of these three statistics are summarized in Table 2. Here we 
used the permuted data as background control. The average Spearman cor- 
relation between pairs of the statistics is 0.87, suggesting that the three 
statistics are highly consistent in ranking the genes' periodicity. The ap- 
proaches based on permutation control (SNR, LPI) made more significant 
claims than the model selection approach. Overall, we obtained a list of 1898 
significant PE genes that are claimed by all three statistics. 

3.3. Performance comparison. To evaluate the performance of identify- 
ing PE genes, we defined a benchmark set as the union set of the list of PE 
genes derived from small-scale experiments [Marguerat et al. (2006)] and a 
core set of genes whose periodic regulation is conserved between budding 
yeast and fission yeast [Lu et al. (2007)]. The resulting benchmark set con- 
sists of 162 genes. We used this benchmark set to compare our method with 
the method used by Marguerat et al. (2006). 

The statistic used for gene classification by Marguerat et al. (2006) is 
a score calculated from a p-value of regulation and a p-value of periodic- 
ity. When combining multiple experiments for gene classification, they mul- 
tiplied the p-values from individual experiments to get a total p-value of 
regulation and a total p-value of periodicity. To estimate the FPR of their 
statistic, we calculated the scores for the permuted data. For our method, 
we use the SNR statistic for gene classification. 

Figure 5 shows the performance of the SNR statistic and Marguerat et 
al.'s score on both the combined data (all experiments) and the Expl data (a 
single experiment) in the form of ROC curves. For any given FPR value, we 



16 



X. FAN, S. PYNE AND S. LIU 



0.5 0.6 


i 

t 

\ 

\ 

\ 

\ 


SNR from combined data 

Marguerat et al.'s score from combined data 

SNR from Exp1 data 
— Marguerat et al.'s score from Exp1 data 


d 




\ 


CO 

d 




— ' \ 


CM 


\ 

\ 

\ 




d 


\ 

V 




d 






o 


\ 




d 





— I 1 1 1 1 1 — 

0.00 0.02 0.04 0.06 0.08 0.10 

FPR 

Fig. 5. Performance on the benchmark set. For each of the four methods listed in the 
figure legend, we plot FNR against FPR under various thresholds. For each threshold, the 
benchmark set of 162 PE genes is used to estimate FNR. The permuted version of the data 
is used to estimate FPR. A smaller under-curve area corresponds to a better classification 
performance for the benchmark set. 

estimate the threshold of a statistic from the permuted version of the data. 
The corresponding false negative rate (FNR) is estimated by the fraction of 
the genes in the benchmark set that are classified as APE gene according to 
this threshold. When applied on the data from a single experiment (Expl 
data), the SNR statistic apparently outperforms Marguerat et al.'s score. 
The gain of statistical power at the single experiment level could be due to 
our explicit modeling of the trend component and the de-synchronization 
effect, which makes our model more realistic for the cell cycle time series. 
When comparing their performances on the combined data, it seems that 
the SNR statistic increases the statistical power over Marguerat et al.'s 
score significantly. This is due not only to a more realistic model for single 
time series, but also to our approach of the Bayesian meta-analysis. Instead 
of combining the p-values from individual experiments, we model multiple 
experiments simultaneously so as to borrow information across experiments. 
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Figure 5 indicates that the same statistic performed better at discerning 
PE genes with the combined data than with the data from a single experi- 
ment. This is also true when comparing the performances of a statistic using 
the overall combined versus that using any subset of the experiments. The 
detailed information is given in Table 2 in the supplementary material [Fan, 
Pyne and Liu (2009)]. This is natural because any subset contains less infor- 
mation than the full combined data; but on the other hand, it also indicates 
that each experiment captured some information about genes' periodicity 
during the cell cycle. 

3.4. Subset analysis. To compare three individual studies [Rustici et al. 
(2004); Oliva et al. (2005); Peng et al. (2005)] and different experimental 
techniques, we used the same method for the combined data set to fit model 
Mi to all three individual data sets, and also the two collections of ex- 
periments using different synchronization techniques (elutriation or cdc25 
block- release) . We first determined the 95% posterior interval of the SNR 
statistic for each gene to account for the uncertainty of its SNR estimate. 
Then for comparison of all the subsets at the same significance level, we 
claim a gene to be PE if its posterior mean SNR value is above the upper 
97.5% posterior limits of the SNR of at least 4984 (out of 4994) permuted 
"genes." For the combined data, we thus claimed 2032 PE genes. Figure 6(a) 
and Figure 6(b) show the overlap of the results from our subset analyses. 
Figure 6(c) shows the overlap of the original results from the three individ- 
ual studies. There are 976 genes which are reported as PE by our combined 
analysis but not by any of the three original studies. Supporting evidences 
for these genes are included in the supplementary material [Fan, Pyne and 
Liu (2009)]. 

Similar to Figure 6(c), the discrepancy about the count and identity of PE 
genes exists between individual data sets [Figure 6(a)] and across synchro- 
nization techniques [Figure 6(b)] although we have unified the whole analy- 
sis procedure. Therefore, instead of attributing the discrepancy between the 
subsets to inconsistent gene naming or use of different analysis methods or 
arbitrary thresholds [Marguerat et al. (2006)], we suggest that the cause is 
intrinsic to the data. It also shows that most genes in the discrepant part 
show significant periodicity in the combined analysis. The combined anal- 
ysis also captured many genes which can not be detected by subset data 
analysis. Combined with the benchmark analysis, we observed that 5 out of 
the 40 benchmark genes whose periodicity have been confirmed by small- 
scale experiments [Marguerat et al. (2006)] were missed by all three original 
studies as well as our combined analysis. On the other hand, 6 out of the 92 
core environmental stress response genes with known function [Chen et al. 
(2003)] were claimed as periodically expressed by all three original studies 
as well as by our combined analysis, suggesting that their periodic signal 
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Rustici ct al 



Combined only 



Peng et al 




Combined only 



Oliva et al. 




(a) 



(b) 



Rustici et al. 



Combined only 



Peng cl al 




Oliva ct al. 



(c) 



Fig. 6. Venn diagrams showing overlap between claimed PE genes from subsets of the 
data. Each gene set in all diagrams is compared with the result from the combined analysis 
that we did using our method. The number before the plus sign is the number of genes 
also claimed as periodically expressed by our combined analysis. The stand-alone circle 
represents the part which is reported only by the combined analysis, (a) Comparing the 
results from individual data sets using our method, (b) Comparing the results from two 
synchronization techniques using our method, (c) Comparing the results reported in original 
studies. 



is clear to all methods. Possibly, the periodicity measure for widely used 
positive or negative benchmark sets are not quite accurate. 

To investigate the discrepancy between different subsets, we systemati- 
cally tested these subsets' pairwise reproducibility using the posterior mean 
SNR values. If it is true that the genes have an intrinsic order in terms of 
periodicity and all individual data sets are of similar quality in revealing this 
ordering information, the periodicity measures across pairs of subsets should 
be consistent. Each data set yields a SNR vector measuring the periodicity 
of all genes. The key idea is to check whether the Spearman correlation of 
the two SNR vectors is still significant after removing genes which are top 
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ranked in both vectors. The details are shown in Figure 10 in the supplemen- 
tary material [Fan, Pyne and Liu (2009)]. After removing the 847 genes that 
are highly ranked by both Peng et al. and Oliva et al., the remaining genes' 
SNR values from these two data sets show no positive Spearman correlation 
at the significance level of 0.05. This sets the number of reproducible genes 
supported by these two data sets (5 experiments) to 847. This same count 
increases to 934 for Rustici et al. versus Peng et al. (7 experiments), and 
to 1008 for Rustici et al. versus Oliva et al. (8 experiments). The increas- 
ing of reproducible genes is consistent with the increase in the size of data 
involved in comparison. The number further increases to 1554 when com- 
paring elutriation experiments with cdc25 experiments. This suggests that 
although the number of reproducible genes is less than the number of PE 
genes suggested by the combined analysis, the reproducibility is improved by 
including more data in the comparison or by partitioning the data according 
to the experiment technique. 

To explain the above subset discrepancy, possible flaws in the benchmark 
sets and the high number of significant genes in the combined analysis, we hy- 
pothesize a network-based dynamics for the cell cycle process. For instance, 
periodic signals from transcription of key cell cycle-regulated genes propa- 
gate through the relevant downstream regulatory networks of the organism 
potentially targeting a considerable number of genes. Thus, depending on 
the status of the network, these genes may show an observable periodic pat- 
tern under one condition, and be too weak to detect under another condition. 
As a consequence of the combined effect of the variation in periodicity and 
experimental noise, each study could capture a different subset of the PE 
genes. The difference of the cell cycle length shown in Figure 7 in the supple- 
mentary material [Fan, Pyne and Liu (2009)], which could not be explained 
solely by microarray platform difference, is a further evidence of such vari- 
ation. For example, the cell cycle lengths in the posterior mode for the two 
cdc25 experiments in Rustici et al. are 135 and 138 minutes, while in Oliva 
et al. and Peng et al., this number increases to 164 and 173 minutes, re- 
spectively. Although they are using the same synchronization technique on 
the same organism, subtle environmental or physiological differences have 
changed the speed of the cell cycle oscillation. Therefore, it may have also 
changed relative amplitudes of oscillation of the genes leading to overall 
ranking discrepancy. 

4. Conclusion. In spite of the rapid rise in the number of microarray ex- 
periments, many of which address related issues, a systematic meta-analysis 
of such data is rarely attempted. We conducted a meta-analysis of ten fission 
yeast cell cycle genome-wide time-series experiments with a model-based 
Bayesian approach. Compared to other methods, key features of our model 
include the fixed relative phase of the peaking time of the genes across all 
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experiments (e.g., a gene will peak 10 degrees earlier than another gene in 
an experiment if and only if the same happens in another experiment) and a 
flexible amplitude for periodic components. Our approach does not require 
training sets to estimate important global parameters such as the period 
of cell cycle, but to infer them from all the data. Notably, our paramet- 
ric approach deals with phase shift, signal amplitude difference, noise level 
difference and de-synchronization automatically. Despite the high dimen- 
sionality, the implemented MCMC chain mixes well with the help of global 
moves. The residual analysis shows that our model fits the data well. 

A striking finding of our analysis is that more than 2000 genes are signif- 
icantly periodically expressed, which accounts for approximately 40% of all 
the genes in the fission yeast genome. The subset analysis suggests that this 
number may increase with more data included. This enhances greatly the 
current knowledge of only 10-15% of all fission yeast genes that are reported 
as periodically expressed during the cell cycle. Interestingly, genome- wide os- 
cillation has also been reported by recent studies on other cyclic phenomena 
in the cell, such as the metabolic cycle and circadian periodicity [Klevecz 
et al. (2004); Tu et al. (2005); Ptitsyn, Zvonic and Gimble (2007)]. Clearly, 
a certain amount of influence of the global cell cycle processes on most genes 
in the genome, in particular, in unicellular organisms such as fission yeast, 
cannot be ruled out. For instance, the folding and unfolding of chromosomes 
over the course of the cell cycle will have genome-wide incidental effect on 
transcription. However, earlier studies concede that limited ability to distin- 
guish precisely the weakly periodic oscillations from prevalent microarray 
noise only allowed conservative estimates of PE genes. By explicitly model- 
ing periodic and nonperiodic components, and different sources of variation 
and noise, our model-based approach helps to overcome this long-standing 
limitation. The resulting list of more than 2000 PE genes would allow the 
researchers to cast a much wider and deeper net for cell cycle regulated genes 
that can lead to investigation of novel or relatively less known gene modules 
and networks involved in the machinery of cell cycle regulation. 

It should be noted that the key idea behind our model is rather general. 
It can be applied to detect periodic patterns where the amplitude is noisy 
but the patterns are nonetheless consistent across different experiments. The 
data can be any collection of time series. A study of cell cycle data from other 
species, such as the budding yeast, mouse, human, etc., using the proposed 
method can be of immediate interest. 

One possible way to improve the current method is to employ a more 
robust error model, using, for example, ^distributions instead of Gaussians 
for the noise term [Hampcl et al. (1986); Lange, Little and Taylor (1989)]. 
But as a price to pay, the computational complexity may be increased sub- 
stantially. It should be noted that, as stated in Section 2.5.3, alternative 
Bayesian model selection methods may also be applied to this problem. For 
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example, Green (1995) provides a way to perform joint model selection and 
parameter estimation via reversible jump MCMC. It may be applicable to 
this problem if the efficiency of reversible jump MCMC moves can be im- 
proved significantly. The methods proposed by Chib (1995) and Chib and 
Jeliazkov (2001), which estimate the marginal likelihood of the data under 
a model, may also be a worthwhile direction to explore. 



A.l. Prior distribution. We assigned reasonably diffuse but still proper 
prior distributions for all parameters: 



O 9e ~iV(0,Ci), 

b ge ~N(0,C 2 ), 
c ge ~N(0,C 3 ), 
d ge ~Unif(0,C 4 ), 

A ge oc Exp (rate = C 5 ), 0<A ge <C 6 , 
fi e -Unif (C 7 ,C 8 ), 

V>1 OcN(0,Cg), -7r<^i<7T, 

ip e oc N(0,C 1Q ), e = 2,...,E,-ir< i/> e < vr, 

(j) g ~ Unif(-7r,7r), 

A e ~Unif(0,Cii), 

VgeKe ~ InV-^(C l2 , Ce), 

C e ~Exp(C7 13 ). 

The constants in the prior distributions are assigned correspondingly, mak- 
ing use of our prior knowledge: G\ = 1,C 2 = 0.005 2 , C 3 = 0.0001 2 , C 4 = 500, 
C 5 = 10, C 6 = 10, C 7 = 2tt/180, C 8 = 2vr/120, C 9 = 0.2 2 , C 10 = l 2 ,C n = 0.006, 
C , i2 = 4,Ci 3 = 50. 

A.2. Posterior distributions and Metropolis-within-Gibbs. We can write 
down the joint distribution of the data and parameters as 

P (y,e,$,r) 

=p(y|e,*,i>(e,$,r) 

" G ( E I S e \ 
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X P{age)p(bge)p(Cge)p(dge)p{A ge )p{cTg e \C 



\p{^e)p{^e)p{K)p{Ce) )■ 



e=l 



We assume that all missing data are missing completely at random, so 
their corresponding components are simply omitted from this expression. 
Again, we introduce the following symbols for convenience: 

Dget = Yg e t - a ge - b ge T et - c ge (mm(T et - d ge ,0)) 2 , 

R get = D get - Ag e COs(n e T et + tpe + 4>g)^ XeTet , 
Xget = (l,r et , [min(T et - d ge ,0)}), 



X, 



gel 



X, 



.X, 



geS e 



Zget = Yget - A ge cos(// e T et + ljj e + (j) g )e 



-A e T ei 



"gel 



z 



V 



J geS e 



J_ 

Ci 



J_ 

c 2 



J__ 
c 3 J 



From the joint distribution, we can get all full conditional posterior distri- 
butions: 



"ge 

V 



rest ~ N 



xLx, 



ge^'-ge 



+ v 



(J 



!:)'- 



-1 vT 7 / vT y 
^ge^ge / ^-ge^-ge 



a 



9'-' 



+ V 



a 



p(d ge \rest) oc -^-exp 
04 



Z^t=l -"-get 



A ge \rest oc iV(/i,cr 2 ), < A ge < C 6 



BAYESIAN META-ANALYSIS OF CELL CYCLE GENE EXPRESSION 



23 



where 



_ Ef=i cos{fi e T et + jj e + ^e-^Dget - a 2 ge C 5 
/ "~ Ef=i{cos(/i e T et + ^ e + ^)e-^} 2 

2 

2 _ °_9± 



G S e ( p2 



2 < 



C 7 </x e < C 8 , 



purest) oc — — T Texp<^ 

p(^|r eS 0«C 9 - a5 nne X p{-^-^|-), -7r<V» e <7r, fore = l, 
ff =it=i I 2 <V 2C 9J 

»(*.i™-)<xqrnn-p{-§-^}' 



-7r < ^ e < 7T> for e = 2, . . . , E, 



( 1 

f>(<£ s |resi) oc J 7 exp l "T^F P - 7r <</ , 9< 7r , 

g=lt=l I Z(J 9e) 

G S e , J^2 \ 

p(\ e \rest) oc JJ IJ ex p|-2^|' 0<A e <Cn, 



£,=14=1 9e 

cr^ e ~ /n?;-x 2 ( Se + Ci2, 



Cl2Ce + J2t=l R let 



S e + C- 



12 



C e ~Gamma^G + l,^g^- + Ci3^ 



>r- 

For conditional distributions which we only know up to a normaliza- 
tion constant, we used the Metropolis-Hastings algorithm to draw samples. 
When fitting the Mq model to a gene, the full conditional distribution of 
its parameters can be obtained by simply replacing all A ge with zero in the 
corresponding full conditional distribution from M\. 



A. 3. Advanced MCMC moves for better mixing. Besides the basic Metro- 
polis-within-Gibbs iteration, we insert the following moves to perturb the 
MCMC chain in order to help it traverse faster through the high dimensional 
space where there are many local modes and strong correlations among a 
group of parameters. 
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• Phase parameters ip e and (f> g are not identifiable in model M\ because the 
joint posterior distribution is invariant if we add a value to all ifi e and 
subtract the same value from all <fi g . One way to solve this nonidentifia- 
bility problem is to fix one of them, but it appears that the loss of one 
degree of freedom makes the chain very sticky, that is, slow to converge. 
As an alternative, we assign zero-centered normal prior distributions to 
all ip e , and use a transformation group move [Liu and Wu (1999); Liu and 
Sabatti (2000); Liu (2001)] to improve mixing of the MCMC sampler. 
Specifically, we first propose a move by adding a random number z to all 
ip e and subtracting z from all (j) g , and then use the Metropolis-Hastings 
rule to accept or reject this move. Since we only care about the relative 
phases of genes and experiments, we use <fi g + ip\ as the gene's relative 
phase and ip e — ip% as the phase for an experiment. 

• When a gene violates the assumption that its peaking time in the cell cycle 
relative to all other genes is fixed across different experiments, its multiple 
time series will show inconsistent phases, which leads to multiple modes 
for its phase parameter <fi g and amplitude parameters A ge . It is difficult to 
get out of this kind of local mode by updating (fi g and A ge separately and 
locally. We combine the idea of grouping [Liu, Wong and Kong (1994)] and 
Metropolized independence sampling [Hastings (1970); Liu (1996, 2001)] 
to deal with this kind of local mode. We call it the Metropolized indepen- 
dence group sampler (MIPS). We first propose a new (fi g independent of 
old <j> g , say, from its prior distribution or an approximation of its condi- 
tional posterior distribution. Then, we sample all A ge conditional on the 
new (fig . The Metropolis-Hastings rule is used to decide whether to accept 
this move or not. To get a good proposal of A ge , we use linear regression 
to get the least square estimate of A ge and use it as the center of the 
proposal distribution of A ge . 

• We again use MIPS to deal with the strong correlation within the trend 
parameters (a ge , b ge , c ge , d ge ) for a time series. The key is to propose a new 
d ge independent of the old d ge and sample (a ge ,b ge , c ge ) jointly conditional 
on the new d ge , which is a multivariate normal distribution here. 

• There are also strong correlations between A e and all A ge of the same 
experiment e. We still use MIPS to perturb the MCMC chain. We pro- 
pose a new A e independent of the old A e and sample all A ge of the same 
experiment e conditional on the new A e . Similar to the MIPS moves for 
(fi g and A ge of the same gene g, we used the least square estimate of A ge 
to improve the proposal efficiency. 

It should be noted that MIPS improves the mixing of the MCMC chain, 
especially at the initial state of the sampling, with an extra cost in com- 
putation. Our simulations indicated that this is a worthy effort. In meta- 
analysis, it is not unusual that different experiments support different values 



BAYESIAN META-ANALYSIS OF CELL CYCLE GENE EXPRESSION 



25 



for a shared parameter. As a result, the shared parameter may have a multi- 
modal distribution. In that case, strategies such as MIPS for making global 
moves are desirable. 
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SUPPLEMENTARY MATERIAL 

Various supporting materials (DOI: 10.1214/09-AOAS300SUPP; .pdf). In 
this supplement we provide model fitting diagnoses, hierarchical clustering 
results, the effect of data size on the statistical power, supporting evidences 
for newly found genes, and figures referred to in this paper. 
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